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Abstract 

Highly parallel computers with tens of thousands of processors will 
be of rapidly growing importance for highspeed computation. Paral- 
lel programs for these machines should be machine- independent, i.e., 
independent of properties that are likely to differ from one parallel 
computer to the next. In particular, parallel programs should be inde- 
pendent of: 

1. memory organization and communication network, 

2. number of physical processors available, 

3. control mode of the parallel computer (SIMD, MIMD, or MSIMD). 

This paper describes extensions of Modula-2 for writing highly par- 
allel, portable programs meeting these requirements. The extensions 
are: 

• Synchronous and asynchronous forms of a forall statement; 

• Control of the allocation of data to processors. 

Sample programs written with the extensions demonstrate the clar- 
ity of parallel programs when machine-dependent details are omitted. 

The principles of efficiently implementing the extensions on SIMD, 
MIMD, and MSIMD machines are discussed. The extensions are small 
enough to be integrated easily into other imperative languages. 

•Supported by Cooperative Agreement NCC 2-387 between the National Aeronau- 
tics and Space Administration (NASA) and the Universities Space Research Association 
(USRA). 

r Authors’ permanent address: University of Karlsruhe, D-7500 Karlsruhe, FRG. 
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1 Introduction 


Highly parallel machines with thousands and tens of thousands of proces- 
sors are now being manufactured and used commercially. These machines 
will be of rapidly growing importance for highspeed computation. They 
also indicate that a fundamental paradigm shift from the sequential to the 
parallel computer is in progress. This shift is fundamental because it affects 
virtually all areas of computer science, computer engineering, and computer 
applications. 

Ease of programming will be of overwhelming importance for the accep- 
tance of highly parallel machines. At present, writing highly parallel pro- 
grams is still a poorly understood and extremely complicated craft. What 
makes highly parallel programs difficult to write and maintain is that they 
must deal with a plethora of machine-dependent details such as the memory 
organization and interconnection network, the number of processors avail- 
able, and whether the target machine runs in SIMD or MIMD mode. To 
make parallel programs easier to write, maintain, and port, parallel pro- 
gramming languages must abstract from machine-dependent details and al- 
low programs to be formulated in a problem-oriented way. 

The current programming style for machines such as the 65,000-processor 
Connection Machine[3,7] is best characterized as “interconnection program- 
ming”. This style involves exploiting details about the interconnections 
among processors and memory units for squeezing the last bit of perfor- 
mance out of the available hardware. Interconnection programming has the 
same undesirable properties as assembly programming: Programs are diffi- 
cult to understand and maintain, and have to be rewritten for every new 
machine type. The rational for interconnection programming is that the 
communication networks of today’s parallel computers are critical bottle- 
necks, much too slow compared to the speed of the processors. However, 
given the youth of the field, parallel memory organization and compiler 
technology are likely to improve significantly, and might render interconnec- 
tion programming as obsolete as assembly programming. It therefore seems 
appropriate to design programming languages in which details about the 
memory organization and interconnection network are irrelevant. Instead, 
programs should simply exchange data by reading and writing memory in 
parallel, while fast interconnection hardware and compiler technology im- 
plement efficient data transport. The goal is to let the problem dictate the 
data exchanges, and not a particular computer architecture. 

A related machine dependence involves the number of physical proces- 
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sore. On most parallel machines today, programmers are repeatedly faced 
with the problem of simulating a large number of parallel threads of control 
on a comparatively small number of real processors. The resulting programs 
are extremely difficult to understand, because the code for multiplexing the 
processors and for packaging and shifting the data accordingly may obscure 
even simple algorithms. Instead, the problem and the algorithm should dic- 
tate the number of processes to be used, and the underlying runtime system 
should organize the allocation of data and processes to real memories and 
processors. 

A third issue when programming parallel machines is whether they ex- 
ecute in the modes MIMD, SIMD, or MSIMD. MIMD stands for multiple 
instruction streams, multiple data streams and means that each hardware 
processor has its own instruction pointer, executing its own program on its 
own data. Processors run independently of each other, except when syn- 
chronizing or exchanging data. SIMD stands for single instruction stream, 
multiple data stream and means that all processors execute the same in- 
structions in synchrony on their own data, or idle for some instructions. An 
SIMD machine consists of a single control processor and a large number of 
processing elements. The control processor stores the program and issues 
the instructions to the processing elements. Because of the synchronous ex- 
ecution and the elimination of many race conditions, an SIMD machine is 
easier to program than an MIMD machine. An SIMD machine also costs 
less to build than an MIMD machine. First, it needs less memory, be- 
cause the program is stored only once. Second, the processing elements are 
simple arithmetic and logic units without program counters, and therefore 
cheaper to build than full-fledged, general-purpose CPUs. These savings 
are important for machines that incorporate tens of thousands of processing 
elements. The drawback is that an SIMD machine may be difficult to utilize 
fully: Whenever an instruction is issued, only a portion of the processing 
elements may actually be in a state where they can execute it; the rest of 
them idle. 

A compromise between MIMD and SIMD is MSIMD, short for multiple 
SIMD. An MSIMD machine is similar to an SIMD machine, except that the 
single controller is replaced by several, each of which may issue a different 
stream of instructions. The processing elements can choose dynamically 
which instruction stream to follow. The underlying assumption is that, al- 
though a parallel program may branch out into several independent threads 
of control, the number of such threads is much smaller than the number of 
processing elements. For instance, two branches of an IF-statement could 
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be executed simultaneously on an MSIMD machine with two controllers, 
while an SIMD machine would first idle one set of the processing elements, 
then the other. MSIMD may also be viewed as VLIW SIMD, or Very-Large- 
Instruction-Word SIMD. 

It is evident that writing a program explicitly for an MIMD, SIMD, or 
MSIMD machine is another source of machine-dependence. For example, a 
program written for an MIMD computer such as the N-Cube will normally 
not run on an SIMD computer such as the Connection Machine, and vice 
versa. To preserve portability, parallel programs should be written in such a 
way that the synchronous or asynchronous parallelism is determined by the 
problem at hand, not dictated by a particular machine architecture. It is the 
task of the compiler to map synchronous or asynchronous parallelism to the 
capabilities of the available hardware. The synchronous and asynchronous 
language constructs presented below can be executed efficiently on MIMD, 
SIMD, and MSIMD architectures. 

The rest of the paper discusses Modula-2*, an extension of Modula- 
2 [8] for writing highly parallel programs. The extensions abstract from 
the memory organization and the number of physical processors, and let 
the programmer choose explicitly between synchronous and asynchronous 
execution. The necessary extensions were surprisingly small. We chose 
Modula-2 as a base, because we wanted to start experimenting with a sim- 
ple language. Similar extensions can be integrated into other imperative 
programming languages, such as C++ or Ada. We also discuss the prin- 
ciples of how to implement the constructs on MIMD, SIMD, and MSIMD 
machines. 

2 Parallel Programming Constructs 

The extensions of Modula-2 consist of synchronous and asynchronous ver- 
sions of a forall statement, plus a simple, optional declaration for mapping 
array data onto processors. Furthermore, the restrictions in Modula-2 on 
compile- time evaluation of array bounds had to be lifted, to allow for flexible, 
parallel array processing. 

For presenting the syntax of the extensions, we use the EBNF notation of 
the Modula-2 language definition[8], with keywords in upper case, | denoting 
alternation, (...) grouping, and [ . . . ] optionality of the enclosed sentential 
form. = ! "" 
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2.1 Overview of the forall statement 

The forall statement creates a set of processes that execute in parallel. In 
the asynchronous form, the individual processes operate independently; the 
forall simply terminates when the last of the created processes terminates. 

In the synchronous form, the processes created by the forall operate in uni- 
son, but may branch out into mutually independent subsets and then rejoin. 
Although the asynchronous form is the more general one, the synchronous 
form is easier to understand because it causes fewer race conditions, just as 
a clocked hardware circuit causes fewer race conditions than an unclocked 
one. Where necessary, explicit synchronization of asynchronous processes is 
possible with semaphores and the procedures SEND and WAIT, as specified 
(though not as implemented) in Chapter 30 of [8], 

The syntax of the forall is as follows. 

ForallStatement = FORALL ident V SimpleType IN (PARALLEL | SYNC) 

S tatementSequence 

END 

The identifier introduced by the forall statement is local to the state- 
ment and follows the usual scope rules. “SimpleType” is an enumera- 
tion or a subrange of another enumeration. The basic enumerations IN- 
TEGER, CARDINAL, LONGINT, CHAR, and BOOLEAN as well as any 
user-defined enumeration may be used. 

2.2 The asynchronous forall 

The action of the asynchronous forall statement 
FORALL C : T IN PARALLEL SS END 
is as follows. 

1. Assume the number of values of type T is N. The statement creates 
N processes, each supplied with a constant C bound to a unique value 
of T. 

2. The N processes execute the statement sequence SS concurrently. No 
assumptions about the relative speeds of the processes may be made, 
unless explicitly synchronized. The statements in SS may refer to C 
or any other identifier global to the statement. If several processes 
write the same global variable, then it is indeterminate which value is 
eventually stored in it. 
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3. The forall statement terminates when the last of the N created pro- 
cesses terminates. 

In the following simple example, an asynchronous forall statement im- 
plements a vector addition. 

FORALL i: [0..N-1] IN PARALLEL z[i] x[i] + y[i] END 

Since no two processes created by the forall access the same variable, no 
temporal ordering of the processes is necessary. The N processes may exe- 
cute at whatever speed. The forall terminates when all processes created 
by it have terminated. 

Our asynchronous forall is a simplification of the forall statement found 
in the dataflow languages VAL and SISAL[6,5]. It can express the same 
degree of explicit parallelism as its dataflow variants. However, dataflow 
machines can also exploit implicit parallelism, by detecting at runtime that 
certain subexpressions are independent, and then executing these subexpres- 
sions in parallel. A parallel machine constructed out of numerous, individual 
dataflow processors might be able to exploit this type of implicit parallelism. 

2.3 The synchronous forall 
The synchronous forall statement 
FORALL C : T IN SYNC SS END 

differs from the asynchronous form only in that the created processes execute 
the statement sequence SS synchronously. Roughly stated, synchronous ex- 
ecution means that all processes that follow the same path through the 
control flow graph execute instructions in lock step. However, processes on 
differing control flow paths may execute asynchronously. This scheme is not 
SIMD, since control flow may diverge in conditional statements. For exam- 
ple, consider the synchronous execution of an if statement with two arms. 
First, all processes evaluate the condition synchronously. The evaluation 
splits the set of processes into two subsets, depending on the result of the 
condition evaluation. The subset with processes containing the value TRUE 
then executes one arm s ynch ronously, the other subset the other arm. Both 
sets may operate simultaneously. Though processes in the same subset op- 
erate in lock step, the speed of processes in different arms are incomparable. 
When both subsets terminate, they are joined again into one set. 
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The synchronous forall statement is a generalization of the forall for 
SIMD machines described by Hillis and Steele[4]. MSIMD machines can 
execute our synchronous forall directly. SIMD machines can also implement 
it efficiently, because there is no order implied among diverging control flow 
branches. The lack of ordering permits a process scheduling that greatly 
reduces the idling of processors. (See Section 3.2 for more details.) 

Below is the precise definition of synchronous execution. The definition 
is recursive and given for each statement type. 

Sequence: A statement sequence of the form 
Ti;T2; ... Tk 

is executed synchronously by executing the statements Ti synchro- 
nously in sequence. 

Assignment: An assignment statement of the form 
L :- R 

is executed synchronously by N processes as follows. First, all N 
processes evaluate the designator L synchronously, yielding N (not 
necessarily different) results each designating a variable. Second, all 
N processes evaluate the expression R synchronously, yielding N (not 
necessarily different) values. Third, all processes store their values 
computed in the second step into their respective variables computed 
in the first step. If the third step results in several values being stored 
into the same variable, then it is indeterminate which of those values 
will actually be stored after the assignment terminates. 

if: An if statement of the form 

IF El THEN TT1 
ELSIF E2 THEN TT2 

ELSE TTk 
END 

is executed synchronously by N processes as follows. First, all N 
processes evaluate expression El synchronously. Those processes for 
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which El evaluates to TRUE then execute TTl synchronously, while 
the other processes execute E2 synchronously. Those whose evaluation 
of E2 yields TRUE then execute TT2 synchronously, and so on. Thus, 
each IF and ELSIF clause divides the set of remaining processes into 
two independent subsets. The processes remaining after the last EL- 
SIF clause (if any) finally execute TTk synchronously. No assumptions 
may be made about the relative speeds of pairs if processes executing 
different expressions Ei or statement sequences TTi. The synchronous 
execution of the if statement terminates when the last non-empty sub- 
set of processes terminates. 

while: A while statement of the form 
WHILE E DO TT END 

is executed synchronously by N processes as follows. Assume processes 
may be designated either as active or inactive. 

1. Designate all N processes as active. 

2. All active processes execute expression E synchronously. Those 
processes, whose evaluation of E yields false, are designated as 
inactive. 

3. If the set of active processes is empty, then the synchronous exe- 
cution of the while statement terminates. 

4. Otherwise, the active processes execute statement sequence TT 
synchronously. 

5. Continue with step 2. 
forall: A forall statement of the form 

FORALL D : U ... TT END 

is executed synchronously by N processes as follows. 1 First, all N 
processes compute the range U in synchrony. Then each of the N 
processes spawns a new set of processes given by U. If the forall 
specifies synchronous execution, all processes thus created execute the 

'This is a synchronous or asynchronous forall nested within another, synchronous 
forall 


8 


statement sequence TT synchronously; otherwise, they execute asyn- 
chronously. Synchronous execution of the forall terminates, when all 
created processes have terminated. 

WAIT and SEND: Synchronous execution of a WAIT by N processes 
causes all N processes to block if any of them blocks. If the N processes 
have been blocked, they will continue only after all individual processes 
that caused the blocking have been unblocked by a SEND from other 
processes. Clearly, the N processes cannot unblock themselves. 

The synchronous execution of expressions, designators, procedure calls, 
case statements, repeat statements, for statements, loop statements, with 
statements, return statements, and exit statements is defined analogously. 
Of special importance are procedure and function calls, because they allow 
multiple, synchronous subprogram invocations. The definitions are omitted 
here for the sake of brevity. 

2.4 Example 

Consider the problem of summing the elements of a vector in parallel. By 
using a recursive doubling technique, the sum can be computed in O(log N) 
time, where N is the length of the vector. Figure 1 illustrates the process. 

The recursive doubling technique operates basically by computing partial 
sums of length 2, 4, 8, ... N. There is a one-to-one mapping between process 
numbers and elements of the vector. By inspecting the assignment statement 
we note that only process i will update the i’th element of the vector. In the 
first iteration, all odd numbered processes are disabled by the if statement, 
since that statement has no second arm. Thus, only the even numbered 
processes update their respective vector elements. Each of those processes 
does so by retrieving the element to the right of V[i] and adding it to V[t]. 
(The second condition in the if statement makes sure that the last process 
does not attempt to access a non-existing vector element.) In the next 
iteration, only the processes divisible by 4 will update their values, but this 
time they reach for elements that are a distance of 2 away. These are the 
even numbered elements. Note that these elements already contain sums of 
subvectors of length 2. The result is that now the updated array elements 
contain partial sums of length 4. This process continues by doubling the 
length of the partial sums in each step, until V[0] contains the desired result. 
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Figure 1: Computing the Sum of a Vector 


VAR V : ARRAY [0 . . N] OF REAL; 

VAR stride: CARDINAL; 

BEGIN 

stride :* 1; 

WHILE stride <= N DO 

FORALL i : [0 . . N] IN PARALLEL 

IF ((i MOD (stride*2))=0) AND ((i+stride)<=N) THEN 
V[i] :* V[i] + V[i+8tride] 

END 

END; 

stride :* stride * 2 
END (* sum in V[0] *) 

END 


Note that the process selection is such that in each iteration, none of 
the processes interfere. Each process reads and writes its own pair of vector 
elements. Thus, we can use the asynchronous form of the forall statement. 
The only requirement is that all processes complete before the next iteration 
commences, but that property is assured by the semantics of the forall. 
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For illustrating the use of the synchronous forall, consider interchanging 
the while and forall statements in the above program. How would that 
change the execution of the program? First, each process would now control 
its own loop, so the loop control variable stride must be replaced by an 
array. Furthermore, the individual processes must now be constrained to 
execute synchronously. Otherwise, we would obtain unpredictable results, 
because the processes may overtake each other arbitrarily. For instance, 
one process might read a vector element that has not yet been updated, or 
it might overwrite a vector element whose old value is still needed. The 
synchronous forall guarantees that no such interference can happen. The 
resulting program is below. 

VAR V : ARRAY [0 .. N] OF REAL; 

VAR stride: ARRAY [0 . . N] OF CARDINAL; 

BEGIN 

FORALL i : [0 . . N] IN SYNC 
stride [i] :* 1; 

WHILE stride [i] <* N DO 

IF ((i HOD (stride [i] *2) ) *0) AND (Ci+stride[i])<=N) THEN 
V[i] := V[i] + V[i+stride[i]] 

END 

END; 

stride [i] :* stride [i] * 2 
END (* sum in V[0] *) 

END 

This program could be transformed again in such a way that not all processes 
execute the loop for the same number of steps. Merging the condition of the 
if statement into the condition of the while statement would stop each loop 
at the right time. Yet another transformation would use N semaphores to 
control the summing process asynchronously. 

2.5 Allocation of array data 

Co-location of data with the processors that operate upon them is impor- 
tant for parallel machines without uniform access time to memory locations. 
Poor alignment of data and processors may cause excessive communication 
overhead. We therefore provide a simple construct for controlling the allo- 
cation of array data to the available processors. This construct is optional, 
and does not change the meaning of a program; it affects only performance. 
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A compiler for a machine with uniform memory access time may ignore the 
construct. 

The allocation of array data to processors is controlled with one allocator 
per dimension. The modified declaration syntax for arrays is as follows: 

ArrayType = ARRAY SimpleType [allocator] 

SimpleType [allocator]} OF type 

allocator = LOCAL | SPREAD | SCATTER 

If the allocator is missing, it is assumed to be SPREAD. The interpre- 
tation of the allocator is as follows. 

LOCAL: Allocate all elements of a dimension marked LOCAL to a single 
processor. 

SPREAD: Distribute the elements of a dimension marked SPREAD over 
all available processors. Elements whose indices differ only by unity in 
the marked dimension must be allocated to the same processor, as far 
as that is possible given that all available processors should be used. 

SCATTER: Distribute the elements of a dimension marked SCATTER 
over the available processors. Elements whose indices differ only by 
unity in the marked dimension must be allocated to different proces- 
sors. 

As an example, consider the following declarations. 

A: ARRAY [1..L] SPREAD [1..M] LOCAL OF T 
B: ARRAY [1..L] SCATTER [1..M] LOCAL OF T 

The LOCAL allocator forces each row of A and B into one processor. If 
the number of available processors, P, is larger than L, then each row is 
allocated to exactly one processor in both cases. If 1 < P < L, then row 
t of A is allocated to processor ((r — 1) 4- [L/P]), while row r of B is 
allocated to processor ((r - 1) mod P). Thus, SPREAD assigns sequences 
of \LjP~\ successive rows to a single processor, while SCATTER distributes 
these sequences over the P processors. 

For a multidimensional array, there is at most one dimension where the 
difference between SPREAD and SCATTER matters. Consider a multidi- 
mensional array C with n dimensions. 
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C: ARRAY [ln..un] allocn ... 


[11.. ul] alloc 1 OF T 


f 1 if alloci = LOCAL 

1 ui — li+1 otherwise 

f nU * if * < < 

1 1 otherwise 

Determine the largest m(l < m < n) such that 

P < D(m,n ) 

where P is the number of available processors. If no such m exists (i.e., 
P > D(l,n)), then there are enough processors to distribute all elements of 
dimensions marked SPREAD or SCATTER to different processors. (There 
is no difference between SPREAD and SCATTER in this case.) If m ex- 
ists, it identifies the dimension where the difference between SPREAD and 
SCATTER applies. If the allocator of that dimension is SPREAD, then 
the array elements C[j n , . . . , j m , . . .] and C\j n , , (j m ± 1),...] must be 
allocated to the same processor, as far as that is possible, given that all 
available processors should be used. If the allocator is SCATTER, then any 
two such array elements must be allocated to different processors. 2 

Dimensions higher than m that are marked SPREAD or SCATTER are 
simply distributed over the available processors. Dimensions lower than m 
are automatically treated as LOCAL, since there are no additional proces- 
sors available to distribute the data. An implementation may also map the 
dimensions m and lower into one dimension (i.e., “unroll” them into one, 
long vector) and then treat the new dimension according to the allocator of 
dimension m. 

The function F defined below provides a suitable mapping of elements 
C[jnt ■ • ■ ,ii] to processors numbered 0...P— 1. Many other choices are 
possible, depending on the interconnections among the processors. 


di = 
D(k,l) = 


F(m,j n ,...,ji) = 


G(m,j n , . . . ,ji) mod P 

if allocm = SCATTER 
,ji) -r \D(m,n)/Pl 
if allocm = SPREAD 


2 If P = D(m), there is again no difference between SPREAD and SCATTER. 
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G(m,j n ,...,j i) = ^2 D(m,i- 1) x 5(t) x (j,- - u<) 

i=m 

_ f 0 if alloci = LOCAL 
^ ' — | 1 otherwise 

Function F can even be used in the case where P > D(l,n), by setting 
m = 1. 

The SPREAD allocator is used to minimize communication overhead in 
case of nearest-neighbor communications, while still utilizing all available 
processors. The SCATTER allocator can keep processor utilization high 
if segments of an array are not being processed, as for example in LU- 
decomposition. 

Callahan and Kennedy[l] have made a different proposal for the distri- 
bution of array data. In their proposal, programmers must provide explicit 
mapping functions for array indices to processor numbers. In our design, 
these mapping functions are created automatically from much simpler al- 
locators, while keeping the program independent of the number of physical 
processors. On the Connection Machine, the default allocation is equiva- 
lent to SPREAD. LOCAL or SCATTER allocations must be programmed 
explicitly. 

2.6 Other extensions of Modula-2 

The original definition of Modula-2 in reference [8] places several restrictions 
on arrays. The first concerns open arrays. An open array is an array without 
declared bounds. Open arrays are essential for subprograms that operate 
on arrays whose size is unknown until runtime. Modula-2 allows only one- 
dimensional, open arrays. For convenient handling of higher-dimensional 
arrays, open array types should be allowed to have more than one dimension. 
Multi-dimensional, open arrays are actually proposed for the ISO-standard 
of Modula-2[2]. 

Another troublesome restrictions involves compile-time constants. For 
example, the forall statement uses subranges, whose bounds, according to 
the original language definition of Modula-2, would have to be compile- 
time constants. This restriction is inappropriate for array parameters whose 
array bounds are not known until runtime. Similarly, it is often necessary 
to create temporary, local arrays whose size is determined by the size of 
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another array that is passed as a parameter. We therefore propose that 
constant expressions are evaluated at runtime. When a constant expression 
is used in a constant declaration, the expression is evaluated and used to 
initialize the constant. No assignments to constants are permitted. When 
used as an array bound, a constant expression is evaluated when the array 
is allocated; the array bounds remain unchanged for the lifetime of the 
airay. Similarly, when used in a subrange of a forall statement, a constant 
expression is evaluated once and used to determine the number of processes. 
The constant expression is not reevaluated for the duration of the forall 
statment. 

As an example, consider the procedure Count. Count returns the num- 
ber of bits in a bitvector that have the value TRUE. A possible solution is 
to sum a vector that is initialized to 1 where the bitvector has the value 
TRUE, and to 0 elsewhere. This vector must be allocated at runtime, since 
it is unknown what size to choose at compile time. 3 It would be wasteful and 
awkward to require that the caller provide the array. Count is illustrated 
below. 

PROCEDURE Count (bits: ARRAY OF BOOLEAN): CARDINAL; 

VAR temp : ARRAY [0 . . HIGH (bits)] OF INTEGER; 

VAR stride: CARDINAL; 

BEGIN 

FORALL i : [0 . . HIGH(bits)] IN PARALLEL 
IF bits[i] THEN temp[i] := 1 
ELSE temp[i] :■ 0 

END 

END 

(* Nos compute the sum of elements of temp with *) 

(* recursive doubling, as described earlier. *) 

RETURN temp[0]; 

END Count 

Another restriction that can be lifted is that set types must have a base 
type with a small cardinality, for example the wordlength of a computer. 
With highly parallel machines, there is no rational for such a severe restric- 
tion. Instead, sets should be allowed to be as large as memory permits. Of 

3 Summing the bitvector itself would not work unless each bit occupies a follow word. 
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course, an implementation is free to pack a set type densely into memory 
words. 

3 Implementation of the forall Statement 

We consider implementing the synchronous and asynchronous forms of the 
forall statement on both synchronous and asynchronous architectures. Par- 
ticular emphasis is on how to simulate a large number of processes, p, that 
potentially exceeds the number of physical processors, P. We assume p > P 
in the following. 

3.1 Process-to-processor assignment 

An efficient assignment of processes to processors is important when there 
are thousands of processors. The assignment can be performed statically by 
the compiler, or dynamically by the runtime system. A static assignment 
has the advantage of eliminating queues of ready processes. The overhead 
for managing these queues might easily exceed the actual work to be done 
in a fine-grained parallel algorithms such as those presented earlier. On the 
other hand, a poor static assignment might not use the available processors 
well. Clearly, any reasonable process assignment must take the allocation of 
data to processors and the communication network into account. 

As an example, consider the following program fragment. 

A: ARRAY [1. .q] SPREAD OF T; 

B: ARRAY [1 . .r] SPREAD OF T; 

FORALL i: [l..p] IN PARALLEL A[el(p)] := B[e2(p)] END 

where el(p) and e2(p) are expressions in p. Without any further assumption 
about these expressions and the relations among p, q, and r, a reasonable 
assignment is to spread the p processes over the same processors that are 
available to the larger of the arrays A and B. Assume these are P processors. 
Let v = f p/P ] . Processor i would then execute processes vi, ... , v(i + 1) — 1 
in sequence. 

Note that the process-to-processor assignment may actually change from 
statement to statement within a single forall, depending on what data struc- 
tures are being accessed. Control may therefore jump processors from one 
statement to the next, or even within a statement. The code produced by a 
compiler optimizing for memory references may therefore be quite different 
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from the traditional method of rescheduling a process onto a potentially dif- 
ferent processor only at synchronization events. Much research in process 
management on highly parallel machines remains to be done. 

Many interconnection networks can treat certain communication pat- 
terns better than others. For instance, on a hypercube network, near- 
neighbor communication in a n-dimensional grid and communication over 
distances that are powers of 2 can be treated more efficiently than others. 
Suppose the index expressions in the above program fragment are linear in 
p, i.e., of the form c\ x p+ C 2 , where ci and C 2 are constants, perhaps even 
powers of 2. In those situations, efficient communication instructions can 
be chosen by a compiler optimizing for a hypercube network. Compilers for 
other communication networks may be able to exploit other special cases. 
Clearly, future research in optimizing compilers must address the problem 
of minimizing communication time in highly parallel machines. 

3.2 Implementation of the asynchronous forall 

The asynchronous forall is easy to implement. Since no assumptions about 
the relative speeds of the processes can be made, an implementation is free 
to choose any order, for example a fully asynchronous, fully synchronous, a 
vectorized, a sequential, or even a random order. 

Recall that the asynchronous forall statement does not terminate until 
all created processes terminate. Thus, on a MIMD machine, all processes 
must perform a synchronization step at the end. An asynchronous reduction 
tree similar to the one described for summing the elements of a vector can be 
used to avoid linear synchronization time when processes terminate nearly 
simultaneously. On an SIMD machine, no such synchronization is necessary. 
On an MSIMD machine, only the controllers need to synchronize. 

An important issue on an SIMD machine is how to fully use the avail- 
able processors. Recall that whenever control flow splits into two or more 
branches, only one set can be fed instructions at a time, while the other sets 
idle. 

To avoid the idling of processors, more sophisticated scheduling algo- 
rithms are possible. The goal is to assign processes to processors in such 
a fashion that nearly all processors remain busy. This goal can be accom- 
plished with a simple rescheduling at every branchpoint. For example, con- 
sider an if statement with two arms, and assume that each of the P physical 
processors is assigned v processes. The usual simulation is to feed both arms 
of the if statement v times to the processors, effectively idling half of the 
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processors. Instead, consider the following scheme. First, each processor 
evaluates the condition for all its assigned processes and divides them into 
two sets, depending on the results. Next, all P processors select processes 
with value TRUE an then receive the instructions for the corresponding arm 
from the controller. When all processors are done with processes containing 
the value TRUE, the controller switches to the other arm. For sufficiently 
high v and evenly distributed values of the conditions, few processors will 
actually idle, achieving nearly full utilization. Note that such a scheduling 
is valid because the asynchronous forall makes no assumptions about the 
relative order of the two arms of the if statement. 

On an MSIMD machine, several branches can be executing in an over- 
lapped fashion, until the number of parallel branches exceeds the number 
of available controllers. Up to that point, all processors can be kept busy. 
After that, the remaining branches are executed in SIMD fashion, possibly 
with rescheduling as discussed above. 

Finally, the asynchronous forall can be executed efficiently on vector 
computers, since order is immaterial. For simulation on sequential ma- 
chines, all that needs to be done is to replace the forall statement with a for 
statement over the same range. Note, however, that such a simple-minded 
simulation may mask many potential programming errors. Perhaps a ran- 
domized order of the processes is more appropriate in helping programmers 
with detecting errors when testing parallel programs on sequential machines. 

3.3 Implementation of the synchronous forall 

Clearly, a MSIMD machine can implement the synchronous forall state- 
ment directly. When the number of parallel branches exceeds the number of 
controllers, the remaining branches must be simulated by multiplexing the 
available processors, as is done on SIMD machines. Since separate control 
flow branches execute asynchronously, efficient process scheduling is possi- 
ble, as explained in the previous section. 

Achieving synchronous execution on an MIMD machine can be expen- 
sive. A simple, but inefficient simulation would be to insert a synchroniza- 
tion command after every instruction. Then each instruction would essen- 
tially take time proportional to the logarithm of the available processors, 
which would slow down all programs significantly. Fortunately, a synchro- 
nization command is not needed for every instruction, but only before and 
after every memory write. This synchronization suffices because processes 
are affected by other processes only through changes in memory. N evert he- 
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less, a modest amount of hardware for simulating SIMD, such as a global 
clock for instruction execution, would provide a much more efficient imple- 
mentation of synchronous execution. 

If the number of processes exceeds the number of processors, it is impor- 
tant that the multiplexing does not violate the semantics of the synchronous 
forall. Consider the following statement. 

FORALL i: [ ] IN SYNC x[i+l] :* x[i] END 

If process i actually executes before process i + 1, a naive implementation 
would produce incorrect results, even on a SIMD machine. Instead, all pro- 
cesses have to follow faithfully the steps in the definition of the synchronous 
execution of the assignment statement. Each process must first evaluate the 
left and right sides of the assignment before any write to memory takes place. 
This means that each process must save both a pointer and a value and wait 
for all processes to complete their evaluation of the left and right hand sides 
before the actual assignment. Since evaluating the right and left sides might 
cause side effects (through function calls, for instance), these computations 
must be carried out such that they appear synchronously even if there are 
more processes than processors. 

Correct, synchronous execution requires overhead in space and time. 
This overhead can be reduced on both SIMD and MIMD architectures if 
the synchronous forall can be transformed into the asynchronous form with 
no or infrequent synchronization. 

4 Conclusions 

We have presented simple language constructs for writing highly paral- 
lel, machine-independent programs. These constructs can be implemented 
efficiently on SIMD, MSIMD, and MIMD computers. Simple, machine- 
independent control over the mapping of data to processors allows compilers 
to optimize communication time on architectures with distributed memory. 
Work on optimizing compilers for a highly parallel machine is in progress. 
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